Analysis of Phonetic Matching Approaches for Indic Languages

نویسندگان

  • Sandeep Chaware
  • Srikantha Rao
چکیده

Phonetic matching plays an important role in multilingual information retrieval, where data is manipulated in multiple languages. User needs information in their local language which may be different from the language where data has been maintained. In such an environment, we need a system which matches the strings phonetically irrespective of errors either exactly or approximately. There are many errors or variations can be considered but here we had considered typographical errors, spelling errors as differ in vowel and matching of compound words. There are many approaches has been proposed like soundex, q-gram, phoenix etc., but they may produce an ambiguity in matching or may not be applicable to Indian languages. In this paper, we proposed approaches which match the strings either in Hindi or Marathi accurately. We evaluated the three approaches namely Soundex, Q-gram and Indic-Phonetic by generating cases like length-of-string (LOS), differ in vowel and compound words for Hindi and Marathi. We found that Indic-Phonetic approach is an efficient and accurate as compared to other two approaches. KeywordsSoundex, Q-gram, Indic-phonetic, threshold, phonetic matching.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Text Input Scheme for Indic Languages with Large Numbers of Print- able Characters

This paper discusses design and development of a text-input scheme for phonetic Brahmic languages with a large number of printable characters. We devise an input scheme for an exemplar Indic language with the understanding that the findings are generalizable to other Indic languages. Our results show that a casual user is able to type at a reasonable speed with our approach.

متن کامل

مقایسه روش های طیفی برای شناسایی زبان گفتاری

Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...

متن کامل

An architecture for the shaping of Indic texts

There has been virtually no software localization into any of the major languages of India. One important reason for this is the fact that enabling Indic scripts at the base software level involves sophisticated computational process that are far from the traditional font level substitutions that suffice for a number of other world languages. Indian languages are basically phonetic in nature an...

متن کامل

On the Utility of a Syllable-like Segmentation for Learning a Transliteration from English to an Indic Language

Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found u...

متن کامل

The Festvox Indic Frontend for Grapheme-to-Phoneme Conversion

Text-to-Speech (TTS) systems convert text into phonetic pronunciations which are then processed by Acoustic Models. TTS frontends typically include text processing, lexical lookup and Grapheme-to-Phoneme (g2p) conversion stages. This paper describes the design and implementation of the Indic frontend, which provides explicit support for many major Indian languages, along with a unified framewor...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012